Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to load from any HF dataset and CSV #379

Merged
merged 3 commits into from
Feb 13, 2024
Merged

Conversation

krypticmouse
Copy link
Collaborator

Code Snippet:

dl = DataLoader(
    train_size = 0.5,
    dev_size = 0.2,
    test_size = 0.3   
)

dl.from_huggingface("databricks/databricks-dolly-15k")

print(len(dl.train))
print(len(dl.dev))
print(len(dl.test))

self._process_dataset(dataset, fields)

def from_csv(self, file_path:str, fields: List[str] = None):
dataset = load_dataset("csv", data_files=file_path)["train"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious on why to keep ["train"] here. wondering if anyone has datasets not sorted to train/dev/test yet who may want to use it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to support a single csv and create split from that. So the file_path would need to be a dict in order to support the multiple csv split. Though it's something we can iterate on!!

Copy link
Collaborator

@arnavsinghvi11 arnavsinghvi11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good @krypticmouse ! really neat to have widely supported data loading!!

I actually wonder if we can deprecate hotpotqa and maybe refactor with this new DataLoader logic to keep it all clean. (but not needed, good as is too!)

@krypticmouse krypticmouse merged commit 7ea31fe into main Feb 13, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants